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PJ I Abstract: We consider the problem of estimating the mean / of a Gaus- 

^ ■ sian vector Y with independent components of common unknown vari- 

ance a^ . Our estimation procedure is based on estimator selection. More 
Cn ■ precisely, we start with an arbitrary and possibly infinite collection F of 

estimators of / based on Y and. with the same data Y, aim at selecting 

an estimator among F with the smallest Euclidean risk. No assumptions 

f-H I on the estimators are made and their dependencies with respect to Y 

Xy^ • may be unknown. We establish a non-asymptotic risk bound for the se- 

(-H I lected estimator. As particular cases, our approach allows to handle the 

problems of aggregation and model selection as well as those of choosing 
a window and a kernel for estimating a regression function, or tuning 
the parameter involved in a penalized criterion. We also derive oracle- 
type inequalities when F consists of linear estimators. For illustration, 
^SJ ■ we carry out two simulation studies. One aims at comparing our pro- 

K*" I cedure to cross-validation for choosing a tuning parameter. The other 

^^ ■ shows how to implement our approach to solve the problem of variable 

^sL. I selection in practice. 
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1. Introduction 

1.1. The setting and the approach 

We consider tlie Gaussian regression framework 

^i li I ^11 ^ i, . . . , 77. 
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where / = (/i, . . . , /„) is an unknown vector of M" and the Ei are independent 
centered Gaussian random variables with common variance a^. Throughout 
the paper, cr^ is assumed to be unknown which corresponds to the practical 
case. Our aim is to estimate / from the observation of Y . For specific forms 
of /, this setting allows to deal simultaneously with the following problems. 

Example 1 (Signal denoising). The vector f is of the form 

/ = (F(xi),...,F(x„)) 

where Xi, . . . , x„ are distinct points of a set X and F is an unknown mapping 
from X into M. 

Example 2 (Linear regression). The vector f is assumed to be of the form 

f = xp (1) 

where X is a n x p matrix, 13 is an unknown p-dimensional vector and p 
some integer larger than 1 (and possibly larger than n). The columns of the 
matrix X are usually called predictors. When p is large, one may assume that 
the decomposition (1) is sparse in the sense that only few Pj are non-zero. 
Estimating f or finding the predictors associated to the non-zero coordinates 
of P are classical issues. The latter is called variable selection. 

Our estimation strategy is based on estimator selection. More precisely, we 
start with an arbitrary collection F = {/a, A G A} of estimators of / based 
on Y and aim at selecting the one with the smallest Euclidean risk by using 
the same observation Y . The way the estimators f\ depend on Y may be 
arbitrary and possibly unknown. For example, the f\ may be obtained from 
the minimization of a criterion, a Bayesian procedure or the guess of some 
experts. 

1.2. The motivation 

The problem of choosing some best estimator among a family of candidate 
ones is central in Statistics. Let us present some examples. 

Example 3 (Choosing a tuning parameter). Many statistical procedures de- 
pend on a (possibly multi-dimensional) parameter A that needs to be tuned 
in view of obtaining an estimator with the best possible performance. For 
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example, in the context of linear regression as described in Example 2, the 
Lasso estimator (see Tibshirani (1996) and Chen et al. (1998)) defined by 
fx = X/3x with 



Px = argmin 



depends on the choice of the parameter X > 0. Selecting this parameter among 
a grid A C M+ amounts to selecting a (suitable) estimator among the family 
¥ = {fx, AG A}. 

Another dilemma for Statisticians is the choice of a procedure to solve a 
given problem. In the context of Example 3, there exist many competitors 
to the Lasso estimator and one may alternatively choose a procedure based 
on ridge regression (see Hoerl and Kennard (1970)), random forest or PLS 
(see Tenenhaus (1998), Helland (2001) and Helland (2006)). Similarly, for the 
problem of signal denoising as described in Example 1, popular approaches 
include spline smoothing, wavelet decompositions and kernel estimators. The 
choice of a kernel may be possibly tricky. 

Example 4 (Choosing a kernel). Consider the problem described in Exam- 
ple 1 with A" = M. For a kernel K and a bandwidth h > 0, the Nadaraya- 
Watson estimator (see Nadaraya (1964) ^'^^ Watson (1964)) fK,h ^ I^" is 
defined as 

fK,h = (FkA^i)^ • • • ) FK,h{^n 



where for x G 






There exist many possible choices for the kernel K , such as the Caussian 
kernel K{x) = e~^ 1'^ , the uniform kernel K{x) = l|x|<i; etc. Civen a (finite) 
family K, of candidate kernels K and a grid Ti C M.^ of possible values of h, 
one may consider the problem of selecting the best kernel estimator among 
the family F = {/a, A = (K, /i) G /C x H}. 
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1.3. A look at the literature 

A common way to address the above issues is to use some cross-validation 
scheme such as leave-one-out or y-fold. Even though these resampling tech- 
niques are widely used in practice, little is known on their theoretical perfor- 
mances. For more details, we refer to Arlot and Celisse (2010) for a survey 
on cross-validation technics applied to model selection. Compared to these 
approaches, as we shall see, the procedure we propose is less time consum- 
ing and easier to implement. Moreover, it does not require to know how the 
estimators depend on the data Y and we can therefore handle the following 
problem. 

Example 5 (Selecting among mute experts). A Statistician is given a col- 
lection F={/a, AgA}o/ estimators from a family A of experts A, each of 
which keeping secret the way his/her estimator f\ depends on the observation 
Y . The problem is then to find which expert A is the closest to the truth. 

Given a selection rule among F, an important issue is to compare the risk 
of the selected estimator to those of the candidate ones. Results in this di- 
rection are available in the context of model selection, which can be seen as 
a particular case of estimator selection. More precisely, for the purpose of 
selecting a suitable model one starts with a collection S of those, typically 
linear spaces chosen for their approximation properties with respect to /, 
and one associates to each model S* G S a suitable estimator fs with values 
in S. Selecting a model then amounts to selecting an estimator among the 
collection ¥ = {fs, S* G §}. For this problem, selection rules based on the 
minimization of a penalized criterion have been proposed in the regression 
setting by Yang (1999), Baraud (2000), Birge and Massart (2001) and Baraud 
et al (2009). Another way, usually called Lepski's method, appears in a series 
of papers by Lepski (1990; 1991; 1992a; 1992b) and was originally designed to 
perform model selection among collections of nested models. Finally, we men- 
tion that other procedures based on resampling have interestingly emerged 
from the work of Arlot (2007; 2009) and Celisse (2008). A common feature 
of those approaches lies in the fact that the proposed selection rules apply 
to specific collections of estimators only. 

An alternative to estimator selection is aggregation which aims at designing a 
suitable combination of given estimators in order to outperform each of these 
separately (and even the best combination of these) up to a remaining term. 
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Aggregation techniques can be found in Catoni (1997; 2004), Juditsky and 
Nemirovski (2000), Nemirovski (2000), Yang (2000a), (2000b), (2001), Tsy- 
bakov (2003), Wegkamp (2003), Birge (2006), RigoUet and Tsybakov (2007), 
Bunea, Tsybakov and Wegkamp (2007) and Goldenshluger (2009) for Lp- 
losses. Most of the aggregation procedures are based on a sample sphtting, 
one part of the data being used for building the estimators, the remaining 
part for selecting among these. Such a device requires that the observations 
be i.i.d. or at least that one has at disposal two independent copies of the 
data. From this point of view our procedure differs from classical aggregation 
procedures since we use the whole data Y to build and select. In the Gaussian 
regression setting that is considered here, we mention the results of Leung 
and Barron (2006) for the problem of mixing least-squares estimators. Their 
procedure uses the same data Y to estimate and to aggregate but requires 
the variance to be known. Giraud (2008) extends their results to the case 
where it is unknown. 



1.4- What is new here? 

Our approach for solving the problem of estimator selection is new. We intro- 
duce a collection S of linear subspaces of M" for approximating the estimators 
in F and use a penalized criterion to compare them. As already mentioned 
and as we shall see, this approach requires no assumption on the family of 
estimators at hand and is easy to implement, an R-package being available 
on 

http : //w3 . j ouy . inra . f r/unites/miaj /public/perso/SylvieHuet_en . html. 

A general way of comparing estimators in various statistical settings has been 
described in Baraud (2010). However, the procedure proposed there is mainly 
abstract and inadequate in the Gaussian framework we consider. 

We prove a non-asymptotic risk bound for the estimator we select and show 
that this bound is optimal in the sense that it essentially cannot be im- 
proved (except for numerical constants maybe) by any other selection rule. 
For the sakes of illustration and comparison, we apply our procedure to var- 
ious problems among which aggregation, model selection, variable selection 
and selection among linear estimators. In each of these cases, our approach 
allows to recover classical results in the areas as well as to establish new 
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ones. In the context of aggregation we compute the aggregation rates for 
the unknown variance case. These rates turn out to be the same as those 
for the known variance case. For selecting an estimator among a family of 
linear ones, we propose a new procedure and establish a risk bound which re- 
quires almost no assumption on the considered family. Finally, our approach 
provides a way of selecting a suitable variable selection procedure among a 
family of candidate ones. It thus provides an alternative to cross-validation 
for which little is known. 

The paper is organized as follows. In Section 2 we present our selection 
rule and the theoretical properties of the resulting estimator. For illustra- 
tion, we show in Sections 3, 4 and 5 respectively, how the procedure can be 
used to aggregate preliminary estimators, select a linear estimator among a 
finite collection of candidate ones, or solve the problem of variable selection. 
Section 6 is devoted to two simulation studies. One aims at comparing the 
performance of our procedure to the classical V^-fold in view of selecting a 
tuning parameter among a grid. In the other, we evaluate the performance 
of the variable selection procedure we propose to some classical ones such 
as the Lasso, random forest, and others based on ridge and PLS regression. 
Finally, the proofs are postponed to Section 7. 

Throughout the paper C denotes a constant that may vary from line to line. 

2. The procedure and the main result 

2.1. The procedure 

Given a collection F = {/a, A G A} of estimators of / based on y, the selection 
rule we propose is based on the choices of a family § of linear subspaces of 
M", a collection {Sa, A G A} of (possibly random) subsets of S, a weight 
function A and a penalty function pen, both from S into M+. We introduce 
those objects below and refer to Sections 3, 4 and 5 for examples. 

2.1.1. The collection of estimators F 

The collection F = {/a, A G A} can be arbitrary. In particular, F need not be 
finite nor countable and it may consist of a mix of estimators based on the 
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minimization of a criterion, a Bayes procedure or the guess of some experts. 
The dependency of these estimators with respect to Y need not be known. 
Nevertheless, we shall see on examples how we can use this information, when 
available, to improve the performance of our estimation procedure. 

2.1.2. The families S and E>x 

Let § be a family of linear spaces of M" satisfying the following. 

Assumption 1. The family S is finite or countable and for all 5 G S, 
dim(5) <n-2. 

To each estimator fx G F, we associate a (possibly random) subset Sa C S. 

Typically, the family § should be chosen to possess good approximation 
properties with respect to the elements of F and Sa with respect to f\ specif- 
ically. One may take Sa = S but for computational reasons it will be conve- 
nient to allow Sa to be smaller. The choices of Sa may be made on the basis 
of the observation fx- We provide examples of § and §a in various statistical 
settings described in Sections 3 to 5. 

2.1.3. The weight function A and the associated function pen^ 

We consider a function A from S into M+ and assume 

Assumption 2. 

S = ^e-^(^)<+oo. (2) 

Whenever S is finite, inequality (2) automatically holds true. However, in 
practice E should be kept to a reasonable size. When S = 1, e~^*^'^ can be 
interpreted as a prior distribution on S and gives thus a Bayesian flavor to the 
procedure we propose. To the weight function A, we associate the function 
pen^ mapping § into IR+ and defined by 



E 



^_^en^G5L^ 
n — dim(S') 



e-^^"^ (3) 



where x+ denotes the positive part of x G M and U, V are two independent 
X^ random variables with respectively dim(S') + 1 and n — dim(S') — 1 degrees 

imsart-generic ver. 2010/04/27 file: LinSelect-12-04-2011.tex date: June 23, 2011 



Y. Baraud et al/Estimator selection 



of freedom. This function can be easily computed from the quantiles of the 
Fisher distribution as we shall see in Section 8.1. From a more theoretical 
point of view, it is shown in Baraud et al (2009) that under Assumption 3 
below, there exists a positive constant C (depending on n only) such that 



pen^(5) < C(dim(5) V A(5)). 
Assumption 3. There exists n G (0, 1) such that for all S &'Ei, 

1 < dim(5) V A(5) < Kn. 



(4) 



2.1.4- The selection criterion 



The selection procedure we propose involves a penalty function pen from S 
into M+ with the following property. 

Assumption 4. The penalty function pen satisfies for some K > 1, 

pen{S) > Kpen^iS) for all S G §. (5) 

Whenever equality holds in (5), it derives from (4) that pen(S') measures 
the complexity of the model 5* in terms of dimension and weight. 

Denoting Us the projection operator onto a linear space S C M."^, given the 
families Sa? the penalty function pen and some positive number a, we define 



crita(/A 



inf 

5eSA 



Y - Ilsf, 



a 



fx - llsfx 



where 



||F-n5F|r 

n — dim(S') 



pen(S) dl 



(6) 



(7) 



2.2. The main result 

For all A G A let us set 

^(/a,Sa) = 



inf 

SG§A 



fx - Hs/a 



pen(5) a 



(8) 



This quantity corresponds to an accuracy index for the estimator fx with 
respect to the family S^- The following result holds. 
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Theorem 1. Let A' > 1, a > 0, 5 > 0. Assume that Assumptions 1, 2 and 4 
hold. There exists a constant C (given by (33)) depending on K and a only 
such that for any f^ in F satisfying 



z have the following bounds 




CE 


f-h 


2" 


< E 


inf 

aga I 

f r 


/-/a 








< ini 


M 


/-/a 



crita(/3^) < inf crita(/A) + ^, 
AeA 



+ ^(/a,S; 



Sa2 + 5 



E 



K!. 



A,S>A] 



jU 



(9) 



(10) 

^11) 



(provided that the quantity involved in the expectation in (10) is measurable). 
Furthermore, if equality holds in (5) and Assumption 3 is satisfied, for each 
AG A 

• if the set Sa is non-random, 
Aifx,Sx) 



C'E 

<E 



f-h 



+ inf 

5eSA 



E 



/a - T^sf> 



+ (dim(5) V ^{S))a^ 



(12) 



if there exists a (possibly random) linear space S\ G §a such that f\ G 
S\ with probability 1, 



C'E 



p -, 


" 






2" 




^(/a,§a) 


<E 


/- 


-/a 




+ E 



dim{Sx) V A (5a 



a 



(13) 



where C is a positive constant only depending on k and K . 

Let us now comment Theorem 1. 

It turns out that inequahty (10) leaves no place for a substantial improve- 
ment in the sense that the bound we get is essentially optimal and cannot be 
improved (apart from constants) by any other selection rule among F. To see 
this, let us assume for simplicity that F is finite so that a measurable mini- 
mizer of crito, always exists and 5 can be chosen as 0. Let K = 1.1, a = 1/2 
(to fix up the ideas), S a family of linear spaces satisfying the assumptions 
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of Theorem 1 and pen, the penalty function achieving equahty in (5). Be- 
sides, assume that S contains a hnear space S such that 1 < dim (5*) < n/2 
and associate to S the weight A(5') = dim(5'). If Sa = S for all A, we de- 
duce from (4) and (10) that for some universal constant C", whatever F and 

f ew 



< E 

< E 



f-h 



inf 

A6A 

inf 

A6A 



f-h 



f-h 



+ inf 



h - n^/A 

2 



h - ^sf. 



+ pen(S')a| 
+ dim(S')a| 



(14) 



In the opposite direction, the following result holds. 

Proposition 1. There exists a universal constant C, such thatjor any finite 
family ¥ = {fx, X E A} of estimators and any selection rule A based on Y 
among A, there exists f E S such that 



CE 



' 




2" 




" 






2 


f- 


-/a 




>E 


inf 


/- 


-h 


+ 










A6A 









h - ^sh 



dim{S)a^ 



(15) 



We see that, up to the estimator a| in place of a^ and numerical constants, 
the left-hand sides of (14) and (15) coincide. 

In view of commenting (11) further, we continue assuming that F is finite so 
that we can keep (5 = in (11). A particular feature of (11) lies in the fact that 
the risk bound pays no price for considering a large collection F of estimators. 
In fact, it is actually decreasing with respect to F (or equivalently A) for the 
inclusion. This means that if one adds a new estimator to the collection F 
(without changing neither S nor the families Sa associated to the former 
estimators), the risk bound for /^ can only be improved. In contrast, the 
computation of the estimator /^ is all the more difficult that |F| is large. 
More precisely, if the cardinalities of the families Sa are not too large, the 
computation of f^ requires around |F| steps. 

The selection rule we use does not require to know how the estimators 
depend on Y. In fact, as we shall see, a more important piece of information 
is the ranges of the estimators fx = /a(^) as Y varies in M". A situation 
of special interest occurs when each fx belongs to some (possibly random) 
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linear space Sx in S with probability one. By taking Sa such that Sx G Sa 
for all A, we deduce from Theorem 1 by using (11) and (13) the following 
corollary. 

Corollary 1. Assume that the Assumptions of Theorem, 1 are satisfied, that 
Assumption 3 holds and that equality holds in (5). If for all X & A there exists 
a (possibly random) linear space Sx G §a such that fx G Sx with probability 
1, then f-^ satisfies 



f-h 


2" 


< inf 

A6A 


E 


/-/a 


2" 


+ E 



CE 



for some C depending on K and n only. 



dim{Sx) V A(^A 



a 



5, 
(16) 



One may apply this result in the context of model selection. One starts 
with a collection of models §={5'^, m E Ai} and associate to each Sm an 
estimator fm with values in Sm- By taking F = {/„, m E A4} (here A = A4) 
and Sm. = {Sm} for all m G A^, our selection procedure leads to an estimator 
ffn which satisfies 



CE 



" 


^ 


2 




' 




^^ 


2" 


/- 


-ffn 




< inf 


E 


/- 


Jm 










rriGM 











+ (dim(^JVA(^J)a2 



When fm = n^^y for all m G A^, our selection rule becomes 

2 



m = arg mm 



Y-f„ 



+ pen(S'„) a. 



(17) 



(18) 



and turns out to coincide with that described in Baraud et al (2009). Interest- 
ingly, Corollary 1 shows that this selection rule can still be used for families 
F of (non-linear) estimators of the form n^^F where the Sfh are chosen ran- 
domly among S on the basis of Y, doing thus as if the linear spaces Sfn were 
non-random. An estimator of the form H^^ F can be interpreted as resulting 
from a model selection procedures among the family of projection estimators 
{IlmF, m E Ai} and hence, (18) can be used to choose some best model 
selection rule among a collection of candidate ones. 

3. Aggregation 

In this section, we consider the problems of Model Selection Aggregation 
(MS), Convex Aggregation (Cv) and Linear Aggregation (L) defined below. 
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Given M > 2 preliminary estimators of /, denoted {(pk-, k = 1, . . . , M}, our 
aim is to build an estimator / based on Y whose risk is as close as possible 
to infggF^ \\f - g\f where 



F 



A 



ua = X]^^'^^' ^^^\ 



and, according to the aggregation problem at hand, A is one of the three sets 

M 



AMs=<!AG{0,ir, ^A, = l 



A 



Cv 



A G Ml^, 



M 



At 



jM 



When A = Ams, Fa is the set {0i, . . . , 0m} consisting of the initial estimators. 
When A = Acv, ^a is the convex hull of the (pj. In the literature, one may 
also find 

{M 
Ae[o,ir, E^^-^i 

in place of Acv in which case Fa is the convex hull of {0, 0i, . . . , 0m}- Finally, 
when A = Al, Fa is the linear span of the 0j. 

Each of these three aggregation problems are solved separately if for each 
A G {Ams^AcviAl} one can design an estimator / = /(A) satisfying 



E 



/-/ 



Cinf \\f-gf<C'^nW 



(19) 



with C = 1, C" > free of /, n, M and 



^n,/ 



M if A = Al 

^yn\og{eM/^/n) if A = Acv and ^<M 

M if A = Acv and v/ri > M 
logM if A = Ams- 



(20) 



These problems have only been considered when the variance is known. The 
quantity %lJn,K then corresponds to the best possible upper bound in (19) over 
all possible / G M" and preliminary estimators 0j and is called the opti- 
mal rate of aggregation. For a more precise definition, we refer the reader 
to Tsybakov (2003). Bunea et al (2007) considered the problem of solving 
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these three problems simultaneously by building an estimator / which satis- 
fies (19) simultaneously for all A e {Ams, Acv, Al} and some constant C > 1. 
This is an interesting issue since it is impossible to know in practice which 
aggregation device should be used to achieve the smallest risk bound: as A 
grows (for the inclusion), the bias inf^gFA 11/ ~ fl'll decreases while the rate 
ipn,A increases. 

The aim of this section is to show that our procedure provides a way of 
solving (or nearly solving) the three aggregation problems both separately 
and simultaneously when the variance is unknown. 

Throughout this section, we consider the family S consisting of the Sm 
defined for each ?Ti C {1, . . . , M} and m ^ as the linear span of the (pj for 
j ^ m. Along this section, we shall use the weight function A defined on § 
by 

A(5'J = |m| + log L I 

take a = 1/2 and pen(.) = l.lpen^(.) taking thus K = 1.1. The choices of a 
and K is only to fix up the ideas. Note that A satisfies Assumption 2 with 
S < 1. To avoid trivialities, we assume all along n > 4. 



3.1. Solving the three aggregation problems separately 

3.1.1. Linear Aggregation 

Problem (L) is the easiest to solve. Let us take F = Fa with A = Al and 



5>L 



{S{i,...M}] (21) 



and Sa = Sl for all A G Al. Minimizing crita(/A) over f\ G Fa amounts to 
minimizing \\Y — f\\\ over f\ G S{i^,„^m} and hence, the resulting estimator 
is merely /l = lis m}^ ■ "^^^ '^^^^ °^ /l satisfies 



E 



f-k 



< inf \\f-g\\' + Ma\ 

SSFa 



whatever n and M which solves the problem of Linear Aggregation. 
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3.1.2. Model Selection Aggregation 

To tackle Problem (MS), we take F = Fa with A = Ams, that is, Fa = 

{01,..., 0m}, 

§ = §MS = {'S'{i}, . . . ,5'{j\/}} (22) 

and associate to each fx = (j)j the collection §a reduced to {5'{j}}. Note 
that dim(^) < 1 and A{S) = log(eM) > dim{S) for all S E §ms, so that 
under the assumption that log(eM) < n/2 we may apply Corollary 1 with 
6 = (since Fa is finite), k = 1/2 and get that for some constant C > the 
resulting estimator /ms satisfies 



CE 



/-/i 



MS 



< inf 11/ 



gf + \og{M)a' 



This risk bound is of the form (19) except for the constant C which is not 
equal to 1. We do not know whether Problem (MS) can be solved or not with 
C = 1 when the variance a^ is unknown and M is large (possibly larger than 
n). 

3.1.3. Convex aggregation 



For this problem, we emphasize the aggregation rate with respect to the 
quantity 



sup 



j=i,...,M cr-y/n 



(23) 



If M < y/nL., take again the estimator /l. Since the convex hull of the 0j is 
a subset of the linear space 5'{i^...^Af}5 for A = Acv we have 



E 



/-/l 



< inf \\f-gf + Ma\ 
gGFA 



Let us now turn to the case M > y/nL. More precisely, assume that 



2 < ^/nL < M < e~^ min (y/nLi 



,nL2^gV^/(2L)| 



(24) 



and set d{n, M) = n/(21og(eM)). We consider the family of estimators F = 
Fa with A = Acv and 

§ = §cv = Sa = {Sm e S, \m\ < d{n,M)} , VA G Acv (25) 
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The set Acv being compact, A t-)- crita(/A) admits a minimum A over Acv 

and we set /cv = fy 

Proposition 2. There exists a universal constant C > 1 such that 

2" 



E 



/-/< 



Cv 



C inf 11/ - ^11^ < cJnL^ \og{eM/V^)a^ . 



This risk bound is of the form (19) except for the constant C which is not 
equal to 1. Again, we do not know whether Problem (Cv) can be solved or 
not with C = 1 when the variance a^ is unknown and M possibly larger than 
n. 

3.2. Solving the three problems simultaneously 



Consider now three estimators /l, /mS) /cv with values respectively in S'{i^...^m}, 
IJ,=i S{j} s-^d the convex hull C of the 0j (we use a new notation for this 
convex hull to avoid ambiguity). One may take the estimators defined in Sec- 
tion 3.1 but any others would suit. The aim of this section is to select the one 
with the smallest risk to estimate /. To do so, we apply our selection proce- 
dure with F = {/l, /ms, /cv}, taking thus A = {L, MS, Cv}, and associate to 
each of these three estimators the families §l,Sms5Scv defined by (21), (22) 
and (25) respectively and choose S = Sl U Sms U Scv 

Proposition 3. Assume that (24) holds and that log(eM) < n/2. There 
exists a universal constant C > such that whatever /l,/ms o,nd /cv with 
values in S{i^,,,^m}, [jj=iS{j} and C respectively, the selected estimator f^ 
satisfies for all f G 



pn 



where 



B, 



CE 



a^M, B 



f-h 



< inf 

Ae{L,MS,Cv} 



E 



f-h 



+ Bx 



MS 



aMogM, B, 



Cv 



a 



M A V nL2 log(eM/y^) 



In particular, if fh/fus o.nd fcv fulfills (19), then 



< inf 

Ag{L,MS,Cv} 



inf \\f-g\\^ + Bx 



where F^ stands for Fa when A = A^. 
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4. Selecting among linear estimator 

In this section, we consider the situation where the estimators f\ are linear, 
that is, are of the form f\ = AxY for some known and deterministic nxn ma- 
trix Ax. As mentioned before, this setting covers many popular estimation 
procedures including kernel ridge estimators, spline smoothing, Nadaraya 
estimators, A-nearest neighbors, projection estimators, low-pass filters, etc. 
In some cases Ax is symmetric (e.g. kernel ridge, spline smoothing, projec- 
tion estimators), in some others Ax is non-symmetric and non-singular (as 
for Nadaraya estimators) and sometimes Ax can be both singular and non- 
symmetric (low pass filters, A-nearest neighbors). A common feature of those 
procedures lies in the fact that they depend on a tuning parameter (possibly 
multidimensional) and their practical performances can be quite poor if this 
parameter is not suitably calibrated. A series of papers have investigated 
the calibration of some of these procedures. To mention a few of them, Cao 
and Golubev (2006) focus on spline smoothing, Zhang (2005) on kernel ridge 
regression, Goldenshluger and Lepski (2009) on kernel estimators and Arlot 
and Bach (2009) propose a procedure to select among symmetric linear esti- 
mator with spectrum in [0, 1]. The procedure we present can handle all these 
cases in an unified framework. Throughout the section, we assume that A is 
finite. 

4-1. The families E>x 

To apply our selection procedure, we need to associate to each Ax a suitable 
collection of approximation spaces §a- To do so, we introduce below a linear 
space Sx which plays a key role in our analysis. 

For the sake of simplicity, let us first consider the case where Ax is non- 
singular. Then Sx is defined as the linear span of the right-singular vectors of 
A^^ — / associated to singular values smaller than 1. When Ax is symmetric, 
Sx is merely the linear span of the eigenvectors of Ax associated to eigenvalues 
not smaller than 1/2. If none of the singular values are smaller than 1, then 
Sx = {0}. 

Let us now extend the definition of Sx to singular operators Ax. Let us recall 
that W^ = keT{Ax) © rg(y4^) where A^ stands for the transpose of Ax and 
rg(A\) for its range. The operator Ax then induces a one to one operator 

imsart-generic ver. 2010/04/27 file: LinSelect-12-04-2011.tex date: June 23, 2011 



Y. Baraud et al/Estimator selection 



17 



between rg(y4^) and Tg{Ax). Write A^ for the inverse of this operator from 
Yg{Ax) to rg(y4^). The orthogonal projection operator from M" onto rg(A^) 
induces a hnear operator from Tg{A\) into Tg{A\), denoted 11^. Then S\ is 
defined as the hnear span of the right-singular vectors of Al^ — Ux associated 
to singular values smaller than 1. Again if this set is empty, Sx = {0}. When 
Ax is non-singular or symmetric, we recover the definition of Sx given above. 

For each A G A, take Sa such that Sa ^ {Sx}- From a theoretical point of 
view, it is enough to take Sa = {•S'a} but practically it may be wise to use 
a larger set and by doing so, to possibly improve the approximation of fx 
by elements of Sa- One may for example take Sa = {S*!, . . . , 5'""^} where 5"^ 
is the hnear span of the right-singular vectors associated to the k smallest 
singular values of A'l — Ux- 



4-2. Choices ofE>, A and pen 

Take S = IJaga ^^ ^"^^ ^ '^^ ^^^ form 

A{S) = a (1 V dim(S)) for all S G S 

where a > 1 satisfies Assumption 2 with S < 1. One may take a = (log |A|)Vl 
even though this choice is not necessarily the best. Finally, for some K > 1, 
take pen(S') = Kpen^{S) for all S* G S and select f^ by minimizing the 
criterion given by (6), taking thus 5 = in (9). 



4-3. An oracle-type inequality for linear estimators 

The following holds. 

Corollary 2. Let K > 1, n & {0, 1) and a > 0. If Assumption 1 holds and 
A(S') < KTi for all 5 G S, the estimator f^ satisfies 



Ca-^E 



for some C depending on K, a and k, only. 



f-h 


2 


<infE 

A 


/-/a 


2" 



a 



The problem of selecting some best linear estimator among a family of those 
have also been considered in Arlot and Bach (2009) in the Gaussian regression 
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framework, and in Goldenshluger and Lepski (2009) in the multidimensional 
Gaussian white noise model. Arlot and Bach proposed a penalized procedure 
based on random penalties. Unlike ours, their approach requires that the 
operators be symmetric with eigenvalues in [0, 1] and that the cardinality of A 
is at most polynomial with respect to n. Goldenshluger and Lepski proposed 
a selection rule among families of kernel estimators to solve the problem of 
structural adaptation. Their approach requires suitable assumptions on the 
kernels while ours requires nothing. Nevertheless, we restrict to the case of the 
Euclidean loss whereas Goldenshluger and Lepski considered more general Lp 
ones. 



5. Variable selection 

Throughout this section, we consider the problem of variable selection in- 
troduced in Example 2 and assume that p > 2 in order to avoid trivialities. 
When p is small enough (say smaller than 20), this problem can be solved 
by using a suitable variable selection procedure that explores all the subsets 
of {1, . . . ip}. For example, one may use the penalized criterion introduced 
in Birge and Massart (2001) when the variance is known, and the one in 
Baraud et al (2009) when it is not. When p is larger, such an approach can 
no longer be applied since it becomes numerically intractable. To overcome 
this problem, algorithms based on the minimization of convex criteria have 
been proposed among which are the Lasso, the Dantzig selector of Candes 
and Tao (2007), the elastic net of Zou and Hastie (2005). An alternative to 
those criteria is the forward-backward algorithm described in Zhang (2008), 
among others. Since there seems to be no evidence that one of these proce- 
dures outperforms all the others, it may be reasonable to mix them all and 
let the data decide which is the more appropriate to solve the problem at 
hand. As enlarging F can only improve the risk bound of our estimator, only 
the CPU resources should limit the number of candidate estimators. 

The procedure we propose could not only be used to select among those 
candidate procedures but also to select the tuning parameters they depend 
on. From this point of view, it provides an alternative to the cross-validation 
techniques which are quite popular but offer little theoretical guarantees. 
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Start by choosing a family C of variable selection procedures. Examples of 
such procedures are the Lasso, the Dantzig selector, the elastic net, among 
others. If necessary, associate to each i E C a. family of tuning parameters 
Hi. For example, in order to use the Lasso procedure one needs to choose a 
tuning parameter h > among a grid -^Lasso C M+. If a selection procedure 
i requires no choice of tuning parameters, then one may take Hi = {0}. Let 
us denote by m{i, h) the subset of {1, . . . ,p} corresponding to the predictors 
selected by the procedure i. for the choice of the tuning parameter h. For 
m C {1, . . . ,p}, let Sm be the linear span of the column vectors X ,,■ for 
i E m (with the convention 5*0 = {0}). For £ G £ and h G Hi, associate to 
the subset m(£, h) an estimator /(^ /j) of / with values in Sfn^i^h) (one may for 



example take the projection of Y onto the random linear space S", 



fh{e.,h) 



but 



any other choice would suit). Finally, consider the family F = {/a, A G A} of 
these estimators by taking A = U^^d^} ^ Hi) and set M. = {m(A), A G A}. 
All along we assume that A is finite (so that we take 5 = in (9)). 



The approximation spaces and the weight function 

Throughout, we shall restrict ourselves to subsets of predictors with cardi- 
nality not larger than some -D^ax < n — 2. In view of approximating the 
estimators /a, we suggest the collection S given by 



^ = Ui*^™! "^ ^ {l,...,p},card(m) < D^^^} . 
We associate to S the weight function A defined for S* G § by 



A(5) = log 



+ log(l + D) with D = dim{S). 



(26) 



(27) 



Since 



SeS 



-A{S) 



D=0 


dim(5) = D 


e-MS) 




P 

E' 

D=0 


-log(l+D) 


< l + log(l+p) 
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Assumption 2 is satisfied with S = 1 + log(l +p). 

Let us now turn to the choices of the Sa C S. The criterion given by (6) 
cannot be computed when Sa = S for all A as soon as p is too large. In 
such a case, one must consider a smaller subset of S and we suggest for 

'^(e,h) = {Sfh(e,h'), h' e Hij 
(where the 5"^ are defined above), or preferably 



^(£,h) 



{s, 



in{£',h'), 



e eC,h' eH^} 



whenever this latter family is not too large. Note that these two families are 
random. 



5.2. The results 

Our choices of A and Sa ensure that fx G Sm{\) G Sa for all A G A and that 

A{Srn{x)) < 2dim(5'm(A))logp. 

Hence, by applying Corollary 1 with Sx = Sfh{x), we get the following result. 

Corollary 3. Let K > 1, k E (0, 1) and -Dmax be some positive integer sat- 
isfying -Dmax < nn/{2\ogp). Let M. = {m(A), A G A} 6e a (finite) collection 
of random subsets of {1, . . . ,p} with cardinality not larger than -Dmax based 
on the observation Y and {fx, A G A} a family of estimators f , also based on 
Y, such that fx G Sfn{\)- By applying our selection procedure, the resulting 
estimator fj^ satisfies 



CE 



^ 


2 




" 




^ 


2 


f-h 




< inf 


E 


/- 


-fx 








AeA 











+ E[dim(S'a(A))]log(p)a' 



where C is a constant depending on the choices of K and k, only. 

Again, note that the risk bound we get is non-increasing with respect to A. 
This means that if one adds a new variable selection procedure or considers 
more tuning parameters to increase A, the risk bound we get can only be 
improved. 
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Without additional information on the estimators fx it is difficult to com- 
pare E [dim(S'm(A))] 0"^ and E \\f — fx\\^ ■ li f\ is of the form UsY for some 
deterministic subset 5 G S it is well-known that 

E [11/ - UsYf] = 11/ - Usff + dim{S)a' > dim{S)a'. 

Under the assumption that / G Sm* and that m* belongs to Ai with prob- 
ability close enough to 1 , we can compare the risk of the estimator /^ to the 
cardinality of m* . 

Corollary 4. Assume that the assumptions of Corollary 3 hold and that f\ = 
n^^ F for all \ E A. If f E Sm* for some non-void subset m* C {1, . . . ,p} 
with cardinality not larger than -Dmax? then 



CE 



f-h 



< \og{p)\m*\a + Rn{rn 



where C is a constant depending on K and n only, and 



Rn{m* 



^ + na^) F 



m* ^ M 



1/2 



Zhao and You (2006) gives sufficient conditions on the design X to ensure 
that P m* ^ M. is exponentially small with respect to n when the family 

Ai is obtained by using the LARS-Lasso algorithm with different values of 
the tuning parameter. 

6. Simulation study 

In the linear regression setting described in Example 2, we carry out a simu- 
lation study to evaluate the performances of our procedure to solve the two 
following problems. 

We first consider the problem, described in Example 3, of tuning the smooth- 
ing parameter of the Lasso procedure for estimating /. The performances of 
our procedure are compared with those of the V^-fold cross-validation method. 
Secondly, we consider the problem of variable selection. We solve it by using 
our criterion in view of selecting among a family £ of candidate variable 
selection procedures. 
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Our simulation study is based on a large number of examples which have 
been chosen in view of covering a large variety of situations. Most of these 
have been found in the literature in the context of Example 2 either for 
estimation or variable selection purposes when the number p of predictors is 
large. 

The section is organized as follows. The simulation design is given in the 
following section. Then, we describe how our procedure is applied for tuning 
the Lasso and performing variable selection. Finally, we give the results of 
the simulation study. 

6.1. Simulation design 

One example is determined by the number of observations n, the number of 
variables p, the nxp matrix X, the values of the parameters /3, and the ratio 
signal/noise p. It is denoted by ex(n,p, X, (3, p), and the set of all considered 
examples is denoted S. For each example, we carry out 400 simulations of Y 
as a Gaussian random vector with expectation / = XP and variance o"^/„, 
where In is the n x n identity matrix, and o"^ = ||/|p/np. 

The collection S is composed of several collections Se for e = 1, . . . ,E where 
each collection Se is characterized by a vector of parameters /3e, and a set Xe 
of matrices X: 

£e = {ex(n,p,X,/3,p): (n,p) G X,X G ;fe, /? = /3e,P G 7^} 

where TZ = {5, 10,20} and X consists of pairs {n,p) such that p is smaller, 
equal or greater than n. The examples are described in further details in 
Section 8.2. They are inspired by examples found in Tibshirani (1996), Zou 
and Hastie (2005), Zou (2006), and Huang et al. (2008) for comparing the 
Lasso method to the ridge, adaptive Lasso and elastic net methods. They 
make up a large variety of situations. They include cases where 

• the covariates are not, moderately or strongly correlated, 

• the covariates with zero coefficients are weakly or highly correlated with 
covariates with non-zero coefficients, 

• the covariates with non-zero coefficients are grouped and correlated 
within these groups, 

• the lasso method is known to be inconsistent, 

• few or many effects are present. 
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6.2. Tuning a smoothing parameter 

We consider here the problem of tuning the smoothing parameter of the 
Lasso estimator as described in Example 3. Instead of considering the Lasso 
estimators for a fixed grid A of smoothing parameters A, we rather focus 
on the sequence {/i, . . . , fo^^^} of estimators given by the -Dmax first steps 
of the LARS-Lasso algorithm proposed by Efron et al. (2004). Hence, the 
tuning parameter is here the number h E H = {1, . . . , -Dmax} of steps. In our 
simulation study, we compare the performance of our criterion to that of the 
\^-fold cross-validation for the problem of selecting the best estimator among 
the collection F = {/i, . . . , /d„,,J- 

6.2.1. The estimator of f based on our procedure 

We recall that our selection procedure relies on the choices of families S, E>h 
for h E H, a. weight function A, a penalty function pen and two universal 
constants K > 1 and a > 0. We choose the family S defined by (26). We 
associate to fh the family S/^ = {5'm(/i')| h! G H} C § where the Sm are defined 
in Section 5.1 and rh{h') C {1, . . . ,p} is the set of indices corresponding to 
the predictors retuned by the LARS-Lasso algorithm at step h' G H. We 
take pen(S') = ii'pen^(S') with A{S) defined by (27) and K = 1.1. This 
value of K is consistent with what is suggested in Baraud et al. (2009). The 
choice of a is based on the following considerations. First, choosing a around 
one seems reasonable since it weights similarly the term \\Y — UsfxW^ which 
measures how well the estimator fits the data and the approximation term 
II /a — IIs/aIP involved in our criterion (6). Second, simple calculation shows 
that the constant C"^ = C"^(l.l,a) involved in Theorem 1 is minimum for 
a close to 0.6. We therefore carried out our simulations for a varying from 
0.2 to 1.5. The results being very similar for a between 0.5 and 1.2, we choose 
a = 0.5. We denote by /pcn^ the resulting estimator of /. 

6.2.2. The estimator of f based on V-fold cross-validation 

For each h E H, the prediction error is estimated using a V-fold cross- 
validation procedure, with V = n/10. The estimator fcv is chosen by mini- 
mizing the estimated prediction error. 
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procedure 
CV 

pcn^ 

Table 1 

Mean, standard- error and quantiles of the ratios Rex/Oex calculated over all ex G £" such 
that Oex < na^/'d. The number of such examples equals 654, see Section 8.2. 



6.2.3. The results 

The simulations were carried out with R (www.r-project.org) using the 
hbrary elasticnet. 

For each example ex G £^, we estimate on the basis of 400 simulations the 
oracle risk 

Oex = ELin||/-Af\ (28) 

and the Euclidean risks -Rcx(/pcn^) and RcxCfcv) of /pcn^ and fcv respec- 
tively. 

The results presented in Table 1 show that our procedure tends to choose 
a better estimator than the CV in the sense that the ratios Rex{fpen^)/Oey: 
are closer to one than RcxCfcv)/Oex- 

Nevertheless, for a few examples these ratios are larger for our procedure 
than for the CV. These examples correspond to situations where the Lasso 
estimators are highly biased. 

In practice, it is worth considering several estimation procedures in order to 
increase the chance to have good estimators of / among the family F. Select- 
ing among candidate procedures is the purpose of the following simulation 
experiment in the variable selection context. 

6.3. Variable selection 

In this section, we consider the problem of variable selection and use the 
procedure and notations introduced in Section 5. To solve this problem, we 
consider estimators of the form ffn = n^^y where m is a random subset 
of {l,...,p} depending on Y. Given a family Ai = {m(£, /i), m{i,h) G 
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£ X Hi} of such random sets, we consider the family F = {fm(e,h)\ (^, h) G 
C X Hi}. The descriptions of £ and H^ are postponed to Section 8.3. Let us 
merely mention that we choose C which gathers variable selection procedures 
based on the Lasso, ridge regression. Elastic net, PLSl regression. Adaptive 
Lasso, Random Forest, and on an exhaustive research among the subsets of 
{1, . . . ,p} with small cardinality. For each procedure i, the parameter set H^ 
corresponds to different choices of tuning parameters. For each A = {i, h) G 
C X Hi, we take S^ = {Sm{i,h)} so that our selection rule over F amounts to 
minimizing over Ai 

crit(m) = ||r - Us^Yf + i^pen^(5^)a|^, (29) 

where pen^ is given by (3). 

6.3.1. Results 

The simulations were carried out with R (www.r-project.org) using the li- 
braries elasticnet, randomForest, pis and the program Im. ridge in the 
library MASS. We first select the tuning parameters associated to the pro- 
cedures i in C. More precisely, for each i we select an estimator among 
the collection F^ = {ffh(e,h)\ h G Hi} by minimizing Criterion (29) over 

JHe = {m{i,h)\h G He}. We denote by fh{i) the selected set and by ffh{e) 
the corresponding projection estimator. For each example ex G £^ and each 
method ^ G £, we estimate the risk 



-Rex/ = E M|/ - ffh(e)\ 



of fm{e) on the basis of 400 simulations and we do the same to calculate that 
of our estimator f^, 

^ex,all = E(ll/-/a|p). 
Let us now define the minimum of these risks over all methods: 

-Rcx,min = inin {-Rex,all, -Rex/, ^ G £} . 

We compare the ratios Rex// Rex,ram for £ G £ U {all} to judge the perfor- 
mances of the candidate procedures on each example ex G £^. The mean, 
standard deviations and quantiles of the sequence {-Rex/Z-Rex.min, ex G S} 
are presented in Table 2. In particular, the results show that 
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4.13 
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118 


rFpurity 
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1.42 
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exhaustive 


22.9 


45 


6.30 


24.5 


92.9 


430 


all 


1.16 


0.16 


1.12 


1.25 


1.47 


1.95 



Table 2 

For each £ G £ U {all}, mean, standard- error and quantiles of the ratios Rex,£ / Rex,min 
calculated over all ex e £" . The number of examples in the collection £ is equal to 660. 



FDR 
TDR 



£i 

0.045 

0.74 



£2 

0.026 

0.63 



^3 

0.004 
0.18 



^9 

0.042 
0.98 



^10 ^11 

0.15 0.014 
0.29 0.20 



£4 £^ £e £7 £s 

0.026 0.018 0.041 0.012 0.026 
0.63 0.17 0.99 1 1 

Table 3 

False dicovery rate (FDR) and true discovery rate (TDR) using our method, for each 
example with p = 10 and n = p = 100. 



• none of the procedures i in C outperforms all the others simultaneously 
over all examples, 

• our procedure, corresponding to i = all, achieves the smallest mean 
value. Besides, this value is very close to one. 

• the variability of our procedure is small compared to the others 

• for all examples, our procedure selects an estimator the risk of which 
does not exceed twice that of the oracle. 

The false discovery rate (FDR) and the true discovery rate (TDR) are also 
parameters of interest in the context of variable selection. These quantities 
are given at Table 3 for each example when p = 10 and n = p = 100. Except 
for one example, the FDR is small, while the TDR is varying a lot among 
the examples. 
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7. Proofs 

7.1. Proof of Theorem 1 

Throughout this section, we use the following notations. For all A G A and 

5* G Sa) "we write 



crita(/A,S') 



Y - n^/A 



cr^pen(S') + a 



fx - Hs/a 



where 



(30) 



pen(5) = pen(5) a^/a^, for all 5 G §. 
For all A G A, let ^(A) G Sa be such that 

crit„(/A,5(A))<crit„(/A) + 5. 

We also write e = Y — f and S for the linear space generated by S and /. It 
follows the facts that for all A G A and S* G Sa 

crit«(/3;, ^(A)) < crit,(/3;) + 5 < crit„(/A) + 26 < crit,(/A, S)+26 

and simple algebra that 



f-^sr.J: 



< 



S{X)-'X 



f - l^sfx 



a 



J\ ''''R(X\J'. 



X '-'-s{X)J X 



+ a 



fx - Ilsfx 



2(r^ ptn{S) + 26 



2(e,n^(3^)/3j-/)-a2pen(5(A)) + 2{eJ-Usfx)-a'pzn{S). 
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For A G A and S* G S, let us set u\^s = ( ^sf\ ~ f) / 



Hsfx - / 
and ux^s = otherwise. For all A and S, we have ux^s ^ S and 



28 

ifn5/A^/ 



/ - ^srxJ: 



< 



5(A) ^ A 



/ - n^/A 



a 



jx ^cJ/^^^ 



S'(A)^A 



+ a 



\^'^A,5(A)/ 



+ '2\{e,ux,s)\ 



fx - Ilsfx 

^s(x)fx ~ f 
Ilsfx - f 



^ 2a^ ptn{S)+2S 
a^pm{S(X)) 



a'pm{S) 



< 



f - Hs/a 



+ a 






J ^S{X)JX 

f-nsfx ' 



fx - Ilsfx 

2 

+ K 



2a^pcn{S)+26 

2 



^^(A)^ 



-a'ptn{S{X)) 



K WUgef - a^ pcn{S) 



Hence, by using (5) and (30) we get 



1-K~^) 


/-n 


^ 2 

^s(x)fx 


+ a 


fx 


< (1 + A-i) 


f-Hsf 


^ 2 

A +a 


< 2(1 + 


K-') 


f-fx 


2 

+2 


5 



^S(A)^A 



fx - Ilsfx 



2a^ pcn{S) + t+26 



where 



+ (« + 2(1 + ^-1)) 



t = 2Kj2(\\Ilse 



fx - Hsfx 



+ 2a^pcn{S) + t 



(31) 



se£ 



PenA(^) , ^ 
n — dim(S') ^ 



For each 5 G S, 



IF-n.Fi 



> 



iF-HcFl 



n — dim(S') n — dini(S') 



|2 . 



and since the variable \\Y — H'oYW is independent of Hn^greH and is stochas- 
tically larger than ||e — n^£:|| , we deduce from the definition of pen^(S') 
and (2), that on the one hand E(i;) < 2Ka^'L. 
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On the other hand, since S is arbitrary among Sa and since 



29 



1 



f-h 



< d-K-'] 



a 1 - /i -1 
we deduce from (31) that for all A G A, 



/ ^six)fx 



+a 



Jx ^^S{X)JX 



f-h 



< C 



-1 



f-fx 



+ A(/,,§a) + S + 5 



with 



C-^ = C-\K,a) 



;i + « - K~^) (« + 2(1 + K-^)) 



(32) 



(33) 



a(l-K-i) 

and (11) follows by taking the expectation on both sides of (32). Note that 
provided that 



inf 

AeA 



f-fx 



^(/a,Sa) 



is measurable, we have actually proved the stronger inequality 



CE 



" 




2] 




r r 




f- 


-h 




<E 


inf <^ 


f-fx 










[AeA 1^ 





+ ^(/a,§a 



a^S + 5. (34) 



Let us now turn to the second part of the Theorem, fixing some A G A. 
Since equality holds in (5), under Assumption 3 by (4) 

pen(^) = Kpen^(S) < C{k, JO(dim(^) V A(5)), MS G §. 

If Sa is non-random, for some C = C'{k,,K) > and all S* G Sa, 



C'E 



Mf 



A,'2>AJ 



< E 
= E 



fx - n^/A 



fx - Hs/a 



+ (dim(^) V A{S))E [al] 
dim{S) V A{S) 



+ 



n — dim(5') 



[\\f-Ilsfr + {n-A\m{S)y]. 



Since 11/ -n5/f< f-Tlsf. 



we have 



llZ-n^/f <E 



/ - Tisfx 



< 2E 



f-fx 



+ 2E 



fx - I^sfx 
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and under Assumption 3, (dim(5') V A(S'))/(n — dim(S')) < k{1 — n) ^, and 
hence for all 5 G §a 



C'E 



A{fx,l 



- ' ^^T^ '^ 



f-f> 



2k 
1- k' 



■E 



fx - iish 



+ (dim(5) VA(5))(t2 



which leads to (12). 

Let us turn i 
one /a e S'a e Sa 



Let us turn to the proof of (13). We set S"^ = S"! . Since with probability 



E 



A{f. 



AjOAl 



<E 



pen(S'A)a 



and it suffices thus to bound the right-hand side. Since equality holds in (5) 
and since fx G ^a 



pen(S'A) al = K 



< K 



< 2K 

< 2K 



pen^(5'A) 
n — dim(S'A) 

pen^(gA) 
n — dim(S'A) 
pen^(5A) 



n — dim(S'A) 

penA(5'A) 
n — dim(5'A) 






Y-fx 



f-f> 



f-f> 



K- 



pen^(5'A) 
n — dim(S'A 



f + e~fx 



lef -2na^),+2na^ 



Under Assumption 3, 1 < A(S'a) V dim(S'A) < nn and we deduce from (4) 
that for some constant C depending only on K and k, 



Cpen{Sx)al < 



f-h 



dim(SA)VA(5A)K 



\ef -2na^' 



+ 1 



and the result follows from the fact that Eidlell — 2ncr^)4-l < 3a^ for all n. 



7.2. Proof of Proposition 1 



For all A e A and f e S, 



f-f> 



> 



n^/A - fx 



and hence, 



f-h 



> inf 

AeA 



/-/> 



> -inf 

2 AeA 



/-/a 



Tlsfx - f> 
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Besides, since the minimax rate of estimation over S is of order dini(5')cr^, 
for some universal constant C, 



CsupE 
/e5 



f-h 



> dim{S)(r^. 



Putting these bounds together lead to the result. 

7. 3. Proof of Proposition 2 

Under (24), it is not difiicuh to see that d{n, M) = n/(21og(eM)) > 2 so 
that § is not empty and since for all 5'^ G Scv 



(dim(^^) V 1) < A(^„ 



\m\ 



log 



M 
\m\ 



n 



< |m|(l + logM) < - 



Assumptions 1 to 4 are satisfied with k = 1/2. Besides, the set Acv being 
compact, A H- crito(/A) admits a minimum over Acv (we shall come back the 
minimization of this criterion at the end of the subsection) and hence we can 
take 6 = 0. By applying Theorem 1 and using (12), the resulting estimator 
/cv = fx satisfies for some universal constant C > 



CE 



f-f^ 



Cv 



< inf {\\f - gf + Aig,^)} 



9GF 



where 



A{g,S) = inf [\\g - Usgf + (dim(5) V A(S)) a'] . 



56 



(35) 



(36) 



We bound A{g, S) from above by using the following approximation result 
below the proof of which can be found in Makovoz (1996) (more precisely, 
we refer to the proof of his Theorem 2). 

Lemma 1. For all g in the convex hull Fa of the (pj and all D > 1, there 
exists m C {1, . . . , M} such that \m\ = {2D) A M and 

\\g-Us„M'<'^D~' sup U.f. 

j=l,...,Af 
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By using this lemma and the fact that log ( "j^ ) < -D \og{eM / D) for all 
D G {1,...,M}, we get 



A(o,§)< inf 

l<D<d{n,Af)/2 



D 



2D{l + \og{eM/{2D)) 



a\ 



Taking for D the integer part of 

x(n,M,L) 



nL"^ 



log(eM/VnL2) 
which belongs to [l,d{n,M)/2\ under (24), we get 



A{g,S) < C'\JnL^\og{eM/y/^^)a^ 



(37) 



for some universal constant C" > which together with (35) leads to the risk 
bound 



E 



/-/< 



Cv 



- C inf 11/ - ^ir < C\^nL'^ \og{eM/V^)a^. 



Concerning the computation of fcv, note that 
infcrit„(/A) = inf inf [\\Y -Usfxf + a\\h -Usfxlf + peniS)al] 

ASA A£A oGSCv 



inf 

S6§cv 



inf (iiy-n^MI' + all/A-n^/All') 



A6A 



+ pen(S') a 



s f ^ 



and hence, one can solve the problem of minimizing critQ,(/A) over A G A by 
proceeding into two steps. First, for each S in the finite set Scv minimize the 
convex criterion 

crit„(5, /a) = \\Y - Usfxf + a II/a - Usfxf 

over the convex (and compact set) Acv Denote by fcv,s the resulting mini- 
mizers. Then, minimize the quantity critQ,(S', /cv,s) + pen(5') a| for 5" varying 
among §cv Denoting by S such a minimizer, we have that /cv = /cv5- 
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7.4- Proof of Proposition 3 

By applying Theorem 1, we obtain that the selected estimator /r satisfies 



CE 



f-h 



< inf 

Ae{L,MS,Cv} 



E 



f-h 



+ E 



A{fx,\ 



Let us now bound E 



A{f. 



A,OAJ 



for each A G A. 
If A = L, by using (12) and the fact that /l G 5'{i_.,,^m}5 we have 

2" 



C'E 



^(/l,Sl 



<E 



/-/l 



Ma\ 



If A = MS, we may use (13) since with probability one /ms ^ ^ms and since 
dim(5) V A(^) < 1 + log(M) for all ^ G §ms, we get 



C'E 



^(/ms,S: 



MS J 



< E 



/-/i 



MS 



log(M)al 



Finally, let us turn to the case A = Cv and denote by g the best approximation 
of / in C. Since /cv G C, for all S G §cv, 



/cv — Hs/c^ 



< 

< 2 



/cv - n^^f 

/ — /Cv 



fcv- f + f - g + g- ^sg 
g - ^sg\\ , 



and hence by using (12) 



C'E 



^(/cv,Sc 



<E 



f-k 



A{g,S, 



•Cv) 



where A{g,Scv) is given by (36). By arguing as in Section (3.1.3), we deduce 
that under (24) 



C'E 



^(/cv,S^ 



CvJ 



<E 



/-/. 



Cv 



+ JnL^ log(eM/vVL2)(T^ 



By putting these bounds together we get the result. 
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7.5. Proof of Corollary 2 



34 



Since Assumptions 1 to 4 are fulfilled and F is finite, we may apply Theorem 1 
and take 5 = 0. By using (12), we have for some C depending on K^ a and 



CE 



f-h 



< inf <i E 

AeA 



/-/> 



E 



/a - ^Sxf^ 



a(l + dim(5A))a^ 



For all AeA, 



E 



11/ -/a 



= ||/-A,/f + E[p,ef] 

= ||/-A,/f + Tr(AlA,)a2 

> max{||/-AA/f ,Tr(AlA,)a2} 



and 



E 



/A-n5jA 



= ||(/-n5jA,/f + E[||(j-n5jA,e||2], 

< 2max{||(/-n5jA,/||%E[p,£f]} 
= 2max{||(/-n5jAA/||',Tr(AlAA)a2} 

and hence. Corollary 2 follows from the next lemma. 

Lemma 2. For all X ^ A we have 

(0 \\{I-UsJAJ\\ < \\f-AJ\\, 
(it) dim{Sx) < ATt{AIAx). 

Proof of Lemma 2: Writing / = /o + /i € ker(y4A) © rg(74^) and using the 
fact that Tg{A\) = ker(74A)"'" and the definition of IIa, we obtain 

Wf-Axff = ll/o + A-^A/if 

= ||/0 - nker(A,)AA/l|r +11^- ^xA,)f^f 

> ||(A+-nA)AA/i|f 

mx 



> ^ 4 < Aa/, t'fc >' 



fc=i 
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where si > ... > Sm^ are the singular values of A^ — Il\ counted with 
their multiplicity and [vi, . . . , Vmx) is an orthonormal family of right- singular 
vectors associated to (si, . . . , s^^). If Si < 1, then S\ = M" and we have 
11/ — Aa/II > ||(/ — ns^)AA/|| = 0. Otherwise, Si > 1, we may consider k\ 
as the largest k such that s^ > 1 and derive that 



fc=i 



2 



k=l 

which proves the assertion (z). 
For the bound (ii), we set Mx = A^ — Ux and note that 

{Mx - Ux){Mx - Ha)* = MxMl + "^^l - M^\ - HaM^ 

induces a semi-positive quadratic form on rg(y4^). As a consequence the 
quadratic form [Mx + Ha) (Ma + IIa)* is dominated by the quadratic form 
2{MxMl + nAn^) on rg(A^). Furthermore 

(MA + nA)(MA + nA)* = {AX){AXr = {AiAxy 

where {A*^Ax)^ is the inverse of the linear operator Lx '■ Tg{A\) -^ Tg{A\) 
induced by ^a^a restricted on Tg{A*^^). We then have that the quadratic form 
induced by {A\Ax)~^ is dominated by the quadratic form 

2{At -llx){Al -UxT + 211x111 

on Tg{A\). In particular the sequence of the eigenvalues of {A*^Ax)~^ is dom- 
inated by the sequence (2s^ + 2)k=i,mx so 

mx -, 

mx -. 

k=kx+l ^ ^' 

which conclude the proof of Lemma 2. 
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7.6. Proof of Corollary 4 



36 



Along the section, we write S"* for Sm* and Sx for Sfh(x) for short. By us- 
ing (10) with 6 = and since S < 1 + log(l + p), we have 



CE 



11/ -/a 



< E 



inf ||/-%Ff + pen(5A)a 



aga 



+ (l + log(p + l))a^ 



for some constant C > depending on K only. Writing B for the event 
B = <m* ^ M\, we have 



where 



E 



Ar. 



R„ 



A6A 



inf^||/-%^Ff + pen(^A)a| 



< A„ + K 



E 



E 



-Us^Yf + pen{S.)al] 



mi\\\f-UsYf + pen{Sx)dl\lB 



AeA 



5a 



Let us bound An from above. Note that ||/ — n5,y |p = Ull^^elp and a|^ = 
II (/ — Ilsje\\'^/{n — dim(S'*)) and since dim(S'*) < -Dmax < '^'^/(Slogp), by 
using (4) we get 

An < {dim{S,) + pen{S,))a^ < C"(l + log(p)) dim{S,)a^, 

for some constant C > depending on K and k only. 



Let us now turn to R'^. For all A G A, ||/ - lis Fp < 



0"; 



\Y -U^Y 



Sx 
2 



and 



< 2- 



+ e 



^ n — dim(S'A) n — dim(5'Aj 

Since for all S* G S, dim(S') < -Dmax < nn/{2\ogp), by using (4) again, there 
exists some positive constant c depending on K and n only such that for all 
A G A, pen(S'A)/(n — dim(S'A)) < c and hence, 

inf {ll/ - %^r f + pen(5A)a|J 1b < (1 + 2c) (ll/f + Ikf ) 1b. 



2\2' 



(ii/ir + ikir) 



< 



Some calculation shows that E 
by Cauchy-Schwarz inequality 

K,<(l + 2c)(||/||2 + 2na2)V^(5). 

The result follows by putting the bounds on An and i?^ together. 



2ncr^) and hence. 
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8. Appendix 

8.1. Computation of pen^(S) 

The penalty pen^(S'), defined at equation (3), is linked to the EDkhi function 
introduced in Baraud al (2009) (see Definition 3), via the following formula: 

Therefore, according to the result given in Section 6.1 in Baraud et al (2009), 
pen^(S') is the solution in x of the equation 

P Fd+z,n-i > X- 



D+l V ^ ^ ^P + 3) 



-x ,,,^ , ,, P Fd+i,n+i > X- 



N{D + l) V "^''''^' - N{D + 1) 

8.2. Simulated examples 

The collection S is composed of several collections Si, . . . ,Sii that are de- 
tailed below. The collections £i to £io are composed of examples where X is 
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generated as n independent centered Gaussian vectors with covariance ma- 
trix C. For each e G {1, . . . , 10}, we define apxp matrix Ce and a p- vector of 
parameters Pe- We denote by Xe the set of 5 matrices X simulated as ra-i.i.d 
Ap(0,Ce). The collection Se is then defined as follows: 

Se = {ex{n,p,X,/3,p),{n,p) e Z,X e Xe, /3 = /3e, p e 11} 

where 7^ = {5, 10, 20} and 

X = {(100, 50), (100, 100), (100, 1000), (200, 100), (200, 200)} (38) 

in Section 6.2, and 

X = {(100, 50), (100, 100), (200, 100), (200, 200)} (39) 

in Section 6.3. 
Let us now describe the collections Si to Siq. 

Collection Si The matrix C equals the p x p identity matrix denoted Ip. 
The parameters /3 satisfy /3j = for j > 16, /3j = 2.5 for 1 < j < 5, Pj = 1.5 
for 6 < J < 10, I3j = 0.5 for 11 < j < 15. 

Collection S2 the matrix C is such that Cjk = r'-^"*^', for 1 < j. A; < 15 and 
IQ < j,k < p with r = 0.5. Otherwise Cj^k = 0. The parameters /3 are as in 
Collection Si. 

Collection S^ The matrix C is as in Collection S2 with r = 0.95, the pa- 
rameters /3 are as in Collection Si. 

Collection S^ The matrix C is such that Cjk = r'-^"*^', for 1 < j,k < p, with 
r = 0.5, the parameters /3 are as in Collection Si. 

Collection S5 the matrix C is as in Collection S4, with r = 0.95, the pa- 
rameters /3 are as in Collection Si. 

Collection Sq The matrix C equals Ip. The parameters /3 satisfy /3j = for 
j > 16, I3j = 1.5 for J < 15. 
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Collection £j The matrix C satisfies Cj^k = (1 — Pi)^j=k + Pi for 1 <,j,k < 
3, Cj^k = Ckj = P2 for j = 4, /c = 1,2,3, Cj^k = li=fc for j,k > 5, with 
pi = .39 and p2 = .23. The parameters /3 satisfy f3j = for j > 4, /3j = 5.6 
for j < 3. 

Collection £s The matrix C satisfies Cj^k = 0.5'-'"'^' for j, k < 8, Cj,k = %=fc 
for j,k > 9. The parameters /3 satisfy /3j = for j ^ {1,2,5}, /3i = 3, 
/32 = 1.5, /3, = 2. 

Collection Sg The matrix C is defined as in Example Sg- The parameters 
/5 satisfy /3j = for j > 9, l3j = 0.85 for j < 8. 

Collection Sio The matrix C satisfies Cj^a,- = O.blj^Lk + llj=fc for j, k < 40, 
Cj^fe = lj=k for j, /c > 41. The parameters /3 satisfy /3j = 2 for 11 < j < 20 
and 31 < j < 40, /3j = otherwise. 

Collection Su In this last example, we denote by Xu the set of 5 matrices 
X simulated as follows. For 1 < j < p, we denote by Xj the column j of X. 
Let E be generated as n i.i.d. A/^(0, O.Ol/p) and let Zi, Z2, Z^ be generated 
as n i.i.d. Ms^O, I3). Then for j = 1, . . . , 5, Xj = Zi + Ej, for j = 6, . . . , 10, 
Xj = Z2 + Ej, for J = 11, ... , 15, Xj = Z3 + Ej, for j > 16, Xj = Ej. The 
parameters /3 are as in Collection Sq. The collection £^11 is defined as the set 
of examples ex{n,p, X, (3, p) for {n,p) G X, X 6 Xu, and p ElZ. 

The collection £ is thus composed of 660 examples for X chosen as in (39), 
and 825 for X chosen as in (38). For some of the examples, the Lasso esti- 
mators were highly biased leading to high values of the ratio Oe^/na"^, see 
Equation (28). We only keep the examples for which the Lasso estimator 
improves the risk of the naive estimator F by a factor at least 1/3. This 
convention leads us to remove 171 examples over 825. These pathological 
examples are coming from the collections £1, £q and £7 for n = 100 and 
P > 100, and from collections £2 and £4 when p = 1000. The examples of 
collection £7 were chosen by Zou to illustrate that the Lasso estimators may 
be highly biased. All the other examples, correspond to matrices X that are 
nearly orthogonal. 
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8.3. Procedures for calculating sets of predictors 

Let M = [Jifzc -^e where we recall that for i e C, Me = {m(£, h)\ h e H(]. 

The Lasso procedure is described in Section 6.2. The collection A^Lasso = 
{m(l), . . . ,m(Dmax)} where m{h) is the set of indices corresponding to the 
predictors returned by the LARS-Lasso algorithm at step /i G {1, . . . , -Dmax} 
(see Section 6.2). 

The ridge procedure is based on the minimization of ||y— X/3|p+/z||/3|p with 
respect to /3, for some positive h, see for example Hoerl and Kennard (2006). 
Tibshirani (1996) noted that in the case of a large number of small effects, 
ridge regression gives better results than the lasso for variable selection. For 
each h G i^ridge, the regression coefficients /3(/i) are calculated and a collection 
of predictors sets is built as follows. Let ji, . . .jp be such that \Pj^{h)\ > . . . > 
\/3jp{h)\ and set 

Mh = {{jl, . . . , jfc}, k = l,..., L'max} ■ 

Then, the collection Abridge is defined as Abridge = {Mh, h G -f/^ridge}- 

The elastic net procedure proposed by Zou and Hastie (2005) mixes the ii 
and £2 penalties of the Lasso and the ridge procedures. Let -f/^ridge be a grid 
of values for the tuning parameter h of the £2 penalty. We choose A^en = 
{^(cn,/i) : h G -ffridge} wherc M(en,/i) denotes the collection of the active sets of 
cardinality less than -Dmax, selected by the elastic net procedure when the £2- 
smoothing parameter equals h. For each h G -f^ridgc the collection M(en,/i) can 
be conveniently computed by first calculating the ridge regression coefficients 
and then applying the LARS-lasso algorithm, see Zou and Hastie (2005). 

The partial least squares regression (PLSRl) aims to reduce the dimension- 
ality of the regression problem by calculating a small number of components 
that are usefuU for predicting Y. Several applications of this procedure for 
analysing high- dimensional genomic data have been reviewed by Boulesteix 
and Strimmer (2006). In particular, it can be used for calculating subsets 
of covariates as we did for the ridge procedure. The PLSRl procedure con- 
structs, for a given h, uncorrelated latent components ti,...,th that are 
highly correlated with the response Y, see Helland (2006). Let Hpis be a grid 
a values for the tuning parameter h. For each h G Hp\s, we write f3{h) for the 
PLS regression coefficients calculated with the first h components. We then 
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set A^pLs = {Mh : h G -f^pis}, where Mh is build from [3{h) as for the ridge 
procedure. 

The adaptive lasso procedure proposed by Zou (2006) starts with a pre- 
liminary estimator /3. Then one applies the lasso procedure replacing the 
parameters |/3j|, j = 1, . . . ,p in the £i penalty by the weighted parameters 
\P3\I\P3V ^j = 1, ... ,p for some positive 7. The idea is to increase the penalty 
for coefficients that are close to zero, reducing thus the bias in the estimation 
of / and improving the variable selection accuracy. Zou showed that, if /3 is a 
\/n-consistent estimator of /3, then the adaptive lasso procedure is consistent 
in situations where the lasso is not. A lot of work has been done around this 
subject, see Huang et al. (2008) for example. 

We apply the procedure with 7 = 1, and considering two different prelimi- 
nary estimators: 

- using the ridge estimator, /3(/i) as preliminary estimator. For each h G 
if ridge, the adaptive lasso procedure is applied for calculating the active sets, 
A^ALridge./i, of Cardinality less than -Dmax- The collection A^ALridge is thus de- 
fined as A^ALridgc = {^iALridge,h, h G -bridge}- 

- using the PLSRl estimator, /?(/?-), as preliminary estimator. The proce- 
dure is the same as described just above. The collection MalpIs is defined as 

^ALpls = {^ALpls.h, h G -ffpls}. 

The random forest algorithm was proposed by Breiman (2001) for classi- 
fication and regression problems. The procedure averages several regression 
trees calculated on bootstrap samples. The algorithm returns measures of 
variable importance that may be used for variable selection, see for example 
Diaz-Uriarte and Alvares de Andres (2006), Genuer et al. (2010), Strobl et 
al. (2007; 2008). 

Let us denote by h the number of variables randomly chosen at each split 
when constructing the trees and 

HrF = {pl3 |JG{3,2,1.5,1}}. 

For each h G HrFi we consider the set of indices 

Mh = {{ju---,jk},k = l,...,D„iax}, 

where {ji, . . . ,jk} are the ranks of the variable importance measures. Two 
importance measures are proposed. The first one is based on the decrease in 
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the mean square error of prediction after permutation of each of the vari- 
ables. It leads to the collection A^rFmse = {Mh, h G H^f}- The second one is 
based on the decrease in node impurities, and leads similarly to the collection 

•''•'^l purity 

The exhaustive procedure considers the collection of all subsets of {1, . . .p} 
with dimension smaller than Z^max- We denote this collection A^cxhaustivc- 

Choice of tuning parameters We have to choose -Dmax, the largest number 
of predictors considered in the collection Ai. For all methods, except the 
exhaustive method, -Dmax may be large, say -Dmax < min(n — 2,p). Never- 
theless, for saving computing time, we chose -Dmax large enough such that 
the dimension of the estimated subset is always smaller than -Dmax- For the 
exhaustive method, -Dmax niust be chosen in order to make the calculation 
feasible: -Dmax = 4 for p = 50, -Dmax = 3 for p = 100 and -Dmax = 2 for 
p = 200. 

For the ridge method we choose -f/^ridgc = {10^^, 10"^, 10"^, 1, 5}, and for the 
PLSRl method, Hpi, = 1, . . . ,5. 
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